Towards a Scalable and Robust Entity Resolution -Approximate Blocking with Semantic Constraints
نویسندگان
چکیده
Entity resolution, or record linkage, is the process that identifies data records over one or more datasets which refer to the same real world entity. To deal with large datasets, many real-life applications require scalable and high-quality entity resolution techniques. Blocking techniques can help to scale-up the entity resolution process. Locality sensitive hashing (LSH) is an approximate blocking approach that hashes objects within a certain distance into the same block with high probability. This technique can filter out records with low similarities, thus decreases the number of comparisons. However, the traditional approach only considers the textual or string similarity of records while the semantic similarity or constraints of records are ignored. This project is to propose and implement a framework that incorporates semantic constraints into the approximate blocking process to achieve scalable, high performance entity resolution. Firstly, minhashing based locality sensitive hashing methods are applied to generate minhash signatures based on the textual similarity of records. Then, for the semantic constraints, the whole domain knowledge of a dataset is extracted into a domain tree. After applying constraints functions according to a set of pre-set rules, a set of semantic signatures are generated. Then these two sets of signatures are combined to group the records into blocks. The experiments are conducted based on the Cora dataset. The results show that this framework makes blocking much more accurate, and in the meanwhile keeps high completeness.
منابع مشابه
Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach
Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, howev...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملSemantic Constraint and QoS-Aware Large-Scale Web Service Composition
Service-oriented architecture facilitates the running time of interactions by using business integration on the networks. Currently, web services are considered as the best option to provide Internet services. Due to an increasing number of Web users and the complexity of users’ queries, simple and atomic services are not able to meet the needs of users; and to provide complex services, it requ...
متن کاملAdaptive Candidate Generation for Scalable Edge-discovery Tasks on Data Graphs
Several ‘edge-discovery’ applications over graph-based data models are known to have worst-case quadratic complexity, even if the discovered edges are sparse. One example is the generic link discovery problem between two graphs, which has invited research interest in several communities. Specific versions of this problem include link prediction in social networks, ontology alignment between met...
متن کاملIntelligent scalable image watermarking robust against progressive DWT-based compression using genetic algorithms
Image watermarking refers to the process of embedding an authentication message, called watermark, into the host image to uniquely identify the ownership. In this paper a novel, intelligent, scalable, robust wavelet-based watermarking approach is proposed. The proposed approach employs a genetic algorithm to find nearly optimal positions to insert watermark. The embedding positions coded as chr...
متن کامل